Modelling epidemic spread in cities using public transportation as a proxy for generalized mobility trends

We study how public transportation data can inform the modeling of the spread of infectious diseases based on SIR dynamics. We present a model where public transportation data is used as an indicator of broader mobility patterns within a city, including the use of private transportation, walking etc. The mobility parameter derived from this data is used to model the infection rate. As a test case, we study the impact of the usage of the New York City subway on the spread of COVID-19 within the city during 2020. We show that utilizing subway transport data as an indicator of the general mobility trends within the city, and therefore as an indicator of the effective infection rate, improves the quality of forecasting COVID-19 spread in New York City. Our model predicts the two peaks in the spread of COVID-19 cases in NYC in 2020, unlike a standard SIR model that misses the second peak entirely.

www.nature.com/scientificreports/ announced a state of emergency on March 7 2020, followed shortly by the passage of an executive order, known as 'New York State on PAUSE' , that shutdown all non-essential businesses in the state 9, 17 . New York City saw a decline in subway usage alongside various other modes of public transport, including bikeshares 18,19 , and taxis 20 .
As restrictions were slowly eased in the later half of the year, there was a corresponding increase in the usage of public transport, although some modes were preferred over others at different rates than before the pandemic. Bikehares, for example, had recovered to their 2019 levels by September 30, 2020 while subway ridership was at 30% of pre-pandemic levels 18 . Despite the different rates at which different modes recovered, both bikeshares and the subway saw increases in usage in the latter half of 2020. We believe that this increase in both modes of transport corresponds to the underlying increase in human mobility as restrictions were loosened after June 8, 2020. This motivates us to use this directly measurable traffic as a proxy for all traffic in the city. We collected and analyzed the subway turnstile data of New York City for 12 consecutive months, starting from January 2020 to December 2020. The MTA publishes turnstile data on a weekly basis, which includes administrative information such as the control area, unit number, station name and line name, as well as the counts of the entries and exits at a specific time for a particular turnstile 16 . The system collects these counts every four hours, each of which is a cumulative register value. The data were first converted into dataframes and then into a combination of control area code, remote unit, subunit channel position (SCP), as well as the time of the observation that serves as a unique ID to identify and remove duplicate records. We removed entries related to Port Authority Trans-Hudson (PATH) trains, since they do not represent the mobility among NYC boroughs. The absolute difference between the first and last counts at a turnstile on a particular day defines the number of subway riders passing through that particular turnstile. The geographic coordinates of the station and the borders of each borough allow us to place each station in its corresponding borough. We calculated the total number of borough-level subway riders by summing the numbers of riders at all the turnstiles of all subway stations located within each borough. To estimate the mobility between boroughs, we used a survey that was conducted among subway riders regarding the origin and destination of their trips 21 . Given the number of departures at a given station, we used probabilities extracted from the survey to determine the destinations of those trips. COVID-19 data. We used publicly available data published by the NYC government about the number of new COVID-19 cases reported for each day and for each of the five boroughs 22 . Figure 1a shows that this data has a clear weekly cyclical pattern. Figure 1b shows that this pattern arises because of the much smaller numbers of cases that were reported on weekends than on weekdays. We remove this pattern by using a running 7-day average of the number of daily cases.
We also chose to restrict our analysis to 2020, since the introduction of vaccines in early 2021 decreases the number of people susceptible to infection. To account for this decrease on the spread of epidemics would require the introduction of another parameter that might change the quantitative effect of mobility on the spread of the disease.
Population data. All population data for New York City were taken from the 2020 census conducted by the US Census Bureau and published on their website 23 .

Model
We start with the well-established SIR model 24,25 . While more powerful models for modelling disease spread exist, such as the SEIR model 26 , we picked the SIR model in order to reduce the number of parameters and avoid overfitting. The COVID-19 hospitalization data that we use only reports daily newly infected cases and we do not believe this data is fine-grained enough to justify the use of a more complex model. The SIR model divides the total population (N) into susceptible (S), infected (I), and recovered or dead (R) compartments. The equations governing the spread of the disease are where β and γ are the infection and recovery rate, respectively. We modify the model by dividing the total population into subpopulations called regions, each with a fraction of the population p j living in region j, thus j p j = 1 . When we apply the model to New York City the different p j represent the populations of the five boroughs of New York City, normalized by the total population of the city, which was 8,804,190 in 2020 23 . Within a region, people will have different infection rates based on their activity. The infection rate for individuals working from home and following strict quarantine protocols will be lower than the rate for frontline workers. Each activity-based cohort has an associated infection rate β jc . Additionally, people may also have access to different quality of healthcare, which may impact the frequency of testing and the likelihood of visiting a doctor. Both of these parameters influence the recovery rate of a patient. Each healthcare-based cohort has an associated recovery rate γ je . The fraction of the total population in region j with behavior c and healthcare e is denoted as p jce . For parameters representing the population fractions, an omitted index indicates a sum over all values of this index, so p jc = e p jce and p j = c,e p jce etc.
Each cohort within each region follows SIR dynamics. The equations governing the population fractions of susceptible, infected and recovered individuals are given by: where s jce (t), i jce (t) and r jce (t) are the population fractions of susceptible, infected and recovered individuals, respectively at time t.
Inter-region mixing. So far our model follows straightforward SIR dynamics. We now want to introduce inter-region mixing through a mixing parameter that tells us the population fraction of one region that is visiting another region at a given time. In order to calculate this quantity, we need to know the origin, destination, and trip duration for every rider using the subway. From the data, we only know the borough of departure. We do not know an individual rider's destination based just on the borough they departed from. Using an MTA survey on the use of the NYC subway, we can determine the probabilities of a trip departing and terminating at different boroughs 21 .
In order to determine the average time spent visiting a borough, we also need to know the borough of origin of riders arriving at a station. From the survey, we know P(o j ) , the probability of any trip originating in borough j, P(d j ) , the probability of any trip terminating in borough j, and P(d j ′ |o j ) , the probability that a trip originating in borough j terminates in borough j ′21 .
It will be helpful to define a fractional time, τ , which measures the time of day as the fraction of the day that has passed since midnight. So, for example, 3 PM corresponds to a fractional time of τ = 0.625 . In order to determine the average duration that residents of one borough spend in another borough, we start by treating NYC as a closed system where individuals do not travel into or out of the city, and all residents of a borough return to it at the end of the day. If a rider k leaves borough j at fractional time τ A k and returns at τ D k , then the fraction of the day spent away from the borough is If there are M t total riders on day t, then the average fraction of the day spent away from the borough on that day is This average can be rewritten by collecting all the arrival and departure times separately rather than tracking each rider's individual arrival and departures so that . The subway turnstile data does not track the arrival and departure of individual riders. Instead, it provides a number of snapshots everyday of the cumulative arrivals and departures. So, the data instead provides us with the number of arrivals, A t,j (τ k ) , and departures, D t,j (τ k ) , at fractional time τ k , where the index k no longer refers to riders, www.nature.com/scientificreports/ but to the different times at which the number of entries and exits are recorded. We can then write the average fractional time spent by residents of borough j away from their home borough where t tot is the total number of days. It should be noted that the sum over k is no longer over the number of riders, but over the number of snapshots of total entries and exits taken that day. For a variety of reasons, such as travel into and out of the city and the usage of multiple modes of transport, the number of arrivals and departures at a station will not match exactly. In order to account for this, we match the number of arrivals and departures at a given snapshot in time in the data, and any discrepancy is added to the next time bin. Once all time periods have been accounted for, any unmatched arrivals or departures are ignored. The equation then becomes where where U A t (τ k ) and U D t (τ k ) are the unmatched arrivals and departures from the previous time period. An example of the matching process is shown in Table 1. We can now write our mixing parameter On any given day we could estimate the number of people, expressed as a fraction of the total population, that This would give us an estimate of how many of the people leaving borough j are heading towards j ′ . However, we do not have detailed temporal resolution on the movement of riders within a borough and we do not know when any individual rider returns to their home borough. We define an effective population of visitors by multiplying this quantity with �τ j , the estimate of the average time fraction spent away from borough j, that spend the entire day in borough j ′ . The mixing parameter represents this effective visiting population.
The population fraction that leaves region j for all other regions is f − j = j ′ � =j f jj ′ , while the population fraction that arrives at region j from all other regions is N . Table 1. An example of the matching process using real data. The columns labelled ENTRIES, EXITS, and TIME are from the turnstile data. We calculate the number of arrivals, A t,j (τ ) , by subtracting successive values of the running total of entries. These arrivals are assigned a fractional time, τ , corresponding to the midpoint of successive time snapshots. The departures, D t,j (τ ) , are calculated in the same way.

Entries Exits
Time www.nature.com/scientificreports/ We must now keep track of the part of the susceptible and infected populations of region j that do not leave the region, which we call 'stationary' , given by It should be noted that while this 'stationary' population does not leave the borough, the individuals that constitute this population may still be mobile within their borough. This will be addressed later in this section. We also keep track of infected individuals visiting region j from other regions. These are given by We can now write down the equations for the stationary susceptible and infected populations for region j: We also need to track individuals from region j who are visiting region j ′ . These are given by These individuals will interact with stationary infected individuals from other regions. We can now write the equations for the individuals from region j visiting all other regions We can combine the equations for the stationary and visiting populations by introducing a flow parameter The flow parameter lets us compactly write the dynamics of region j Since the data provided by the NY government tracks the number of newly reported cases and does not report the number of active cases (i(t) in our model) we construct the quantity where t is in units of days. In other words, i new jce (t) represents the number of new cases reported on day t and this is the quantity that we will fit to the data. Figure 2a shows a schematic representation of our mobility-based SIR model. Introducing a public transportation node. While our model accounts for the spread of disease through the transit of infected individuals between regions, it does not take into account that use of public transportation poses a higher risk of infection 3 . To account for this effect, we introduce a public transportation node, denoted by the index T. The fraction of the population permanently residing on this node is 0 ( p T = 0 ). We modify our model so that all riders travelling to another borough spend some part of their time at node T. This duration is taken from the average commute time reported by riders of each borough 21 . The mixing parameter from node j to node T becomes where �τ jT is the commute time for riders in node t, expressed as a fraction of the day. Due to the introduction of a transport node we must also modify our expression for the inter-borough mixing parameter, which becomes (23) s jce (t + �t) =s jce (t) − jce (t)�t, r jce (t + �t) =r jce (t) + i jce (t)γ je �t. www.nature.com/scientificreports/ Figure 2b shows a schematic depiction of the model with the public transportation node. The introduction of such a node allows us to independently model the infection rate during rides on public transportation systems, β T , for individuals using public transportation. Our model does not track individual interactions, but only the infection rate at the population level. We introduce the transport node to model the different rates of infection experienced by the fraction of the population that uses the subway, where they interact with a different mixture of populations than the mixture that they encounter in the boroughs in which they live and work.

Mobility-dependent infection rate.
While inter-region mixing and the introduction of a public transportation node account for mobility between regions, we also need to account for mobility within a region. To do this, we introduce a mobility parameter for each region, m j (t) , which represents the extent to which individuals are moving within the region. We then write our infection rate as where β 0 jce is the static infection rate. For the particular case of the NYC subway, m j (t) is calculated by taking a 7-day moving average of the total trips that start in borough j and rescaling this quantity by dividing it by the maximum number of trips taken in one day in borough j in this training period, thereby scaling it between 0 and 1. A plot of the average mobility parameter, defined as m avg (t) = j p j m j (t) , is plotted in Fig. 4b. We are using the level of subway usage as a stand-in for all short-range mobility. We found that the number of bike-share rides taken during the pandemic was correlated with the number of subway trips 19 . We assume that subway usage is correlated with all mobility within the city, even as subway usage fell during the pandemic across cities around the world 27 .

Results
While our model is able to incorporate complex demographic information such as healthcare status, access to testing, and public policies regarding gathering sizes and mask usage, we are limited by the data to which we have access. Since we only have public transportation data and the daily case count, we will assume that each region in our model, corresponding to one of the five boroughs of NYC, has a uniform demographic distribution. This means that we will be ignoring the c and e indices in our model. In order to model the effect of different policies, we pick March 22, 2020, the official start day of the NY PAUSE Program, as the beginning of the lockdown. We assume that there are two different infection rates, one before and the other after this date. This assumption is made because the PAUSE program marks the start of the implementation of widespread mask usage and social distancing. These are non-mobility factors which impact the overall infection rate.
We also assume that the intensity of usage of public transportation is correlated with the infection rate. The infection rate for borough j then becomes where β p (t) = β h before the start of the NY PAUSE Program on March 22 2020, and β p (t) = β l afterwards. The second term, m j (t − t D ) , represents the normalized daily number of trips taken on the subway within a region. The parameter t D accounts for the population level delay between subway usage and the subsequent increase in β jce (t) = β 0 jce m j (t), Figure 2. (a) Schematic representation of the mobility-based SIR model. Each region j has an associated infection rate β j and mobility parameters f jj ′ and f j ′ j , which represent individuals from region j visiting region j ′ and vice versa. (b) The enhanced model that includes a public transportation node without a permanent population. Inter-region mixing still occurs as in the basic model, but the visiting populations of every region pass through the transportation node for the duration of their commute time during which they are exposed to the higher infection rate associated with using public transportation. The effective population of region j that is commuting is given by f jT while the effective population of all other regions that are visiting region j are given by f + j . www.nature.com/scientificreports/ Covid-19 cases. We also average β j (t) over a 7 day moving window in order to smooth out abrupt changes due to both noise in m j (t) and the discontinuous transition in β p (t).
We have the values of f jj ′ (t) and m j (t) from the data, Using these two values, we can construct f jT . We need to learn the values of β h , β l , γ and τ D . We also need to learn the values of β h T and β l T , the infection rate on the subway before and after the start of the PAUSE program. Figure 3a plots the results of fitting the model by minimizing the mean squared error (MSE). We fit our model by sweeping over a million values for these parameters, with our search guided by existing literature on the infection and the recovery rates 28 . By contrast, we can see from Fig. 3b that an SIR model that does not take into account mobility cannot explain the infection trend.
Forecasting. We also masked the last three weeks of data and trained our model without this period. First, we do a parameter sweep to find the values of the parameters that best fit the training data. Next, we use the end of the training period, i new data (t train ) (where t train , is the last day of the training data) as the initial condition for the testing period. However, we cannot directly use the number of daily new cases as the initial condition. Instead, the model requires knowledge of the active infected and total recovered cases, i(t train ) and r(t train ) , at the end of the training period as the initial conditions for the testing period. This was not a concern when we were fitting our model for the training period since we assumed that i(0) = 1/N and r(0) = 0 . In order to predict the spread of the disease in the testing period, however, we need to know these quantities to serve as the initial conditions for our model. While data are available for the number of active and recovered cases for New York state, they are not available for New York City. We estimate the total recovered population by dividing the cumulative deaths reported in New York City by the state-wide case mortality rate 29 where r est (t) is the estimated total recovered population (which includes both individuals who have died as well as those who have recovered from the disease) expressed as a fraction of the total population of New York City. We also need to know i est (t) , the estimated total active number of infected cases. Specifically, we only need to estimate i est (t train ) , the total number of active cases on the last day of the training data. We do this by searching for a value of i est (t train − 1) such that using i est (t train − 1) and r est (t train − 1) as the initial condition for our model and predicting the daily number of cases for the next day gives us where i new (t train ) are the daily number of new cases output by our model on the last day of the training data. The corresponding values of i est (t train ) and r est (t train ) become the initial conditions for the model at the start of the testing period. The model's prediction is shown in Fig. 4a. Table 2 shows the parameters that minimize the MSE with and without a testing period. Figure 1a shows that the total number of cases in NYC rapidly increased after the discovery of the first recorded case, followed by a decline and then a second rise. This trend seems to follow the usage of the subway: initially, the usage of the subway declines precipitously and then it slowly and partially recovers to about 2/3 of the previous usage.

Discussion
Cumulative deaths reported in NYC on day t N * Case mortality rate on day t , www.nature.com/scientificreports/ By scaling our infection parameter with subway usage, we are simultaneously capturing two effects. The first is the rise in infections directly due to the use of the subway, either through higher infection rate or through case importation between regions. The second is taking subway usage as a proxy for broader mobility trends, which in turn depend upon public policy that governs the infection rate. As more people went back to work and as restrictions on public gatherings, schools etc. were eased, we assume that there was a corresponding increase in human mobility proportional to the increase in subway usage, even though this usage is just one form of the total population mobility in the city.
The fact that our model is able to accurately capture both the first wave of infections as well as the second one indicates that our assumption that subway usage is an indicator for broader human mobility trends (and for public policies regarding restrictions more generally) within the city is correct. While our model does predict a higher infection rate for the subway than for the boroughs, infection trends are much less sensitive to inter-region mobility compared to intra-region mobility.
If we set β j (t) = β p (t) and the dependence on m j (t) is removed, the reduced model is unable to capture the second wave of infections towards the end of the year as shown in Fig. 3b.

Limitations.
The turnstile data that we use imposes some limitations on our model. The most crucial assumption in our work is that subway usage is correlated to all mobility within the city and can therefore be used as a proxy for all mobility. This assumption is supported by the fact that the usage of both bikeshares and The ratio of the average infection rate, β avg (t) = j p j β j (t) , over the recovery rate, γ as a function of time. The second axis shows the average mobility parameter, m avg (t) = j p j m j (t). Table 2. The results from fitting the data with and without the last three weeks masked. E in refers to the MSE of fitting the data, while E out shows the MSE of the model's prediction during the testing period. While we minimize the MSE during the fitting process, the table also reports the in-sample and out of sample R 2 score of the best fit. We use t = 10 −2 for all our simulations. www.nature.com/scientificreports/ taxis dropped at the same time as that of the subway 19,20 , and bikeshare usage increased in the same period as subway usage, although at a much faster rate 18 . Additional data on other forms of mobility, specially in the latter half of 2020, would allow us to construct the mobility parameter that encapsulates multiple modes of transport. We also assume that the residents of a borough that leave it using the subway return to the home borough using the subway on the same day. This assumption impacts our inter-region mixing parameter through the calculation �τ j , the average time spent away from the home borough.
Finally, we assume that the fraction of cases that were reported remained constant throughout 2020. While we have adjusted for the drop in reporting on the weekends by taking a 7-day moving average, the fraction of cases that were reported may have changed over the course of the year due to other factors as well. A possible effect of this variation in the reporting rate is the very high ratio of the average infection rate, defined as β avg (t) = j p j β j (t) , to the recovery rate, γ , that our model predicts during the beginning of the infection, shown in Fig. 4b. The precipitous rise in cases at the start of the pandemic may represent a slew of people getting tested in a short amount of time as awareness of the epidemic spread and widespread testing became available, rather than accurately representing the true spread of the disease. After this initial period our β avg (t)/γ ratio has a minimum of 1.43 and a maximum of 5.25. While an initial estimate for the reproduction number was reported to be 2.2 in Wuhan 30 , other studies using SIR models have reported a much higher reproduction number ranging from a global estimate of 4.5 31 to some regions having a value as high as 7.8 32 . While the ratio β avg (t)/γ is not equivalent to the reproduction number (due to the connectivity of the different compartments in our model) and should be seen only as a crude estimate, it is encouraging that the ratio predicted by our model falls within the range of estimates reported in the literature.

Conclusion
The main contribution of this paper is an introduction of a mobility-based model of epidemic spread that uses a mobility-dependent infection rate. Based only on fitting the data, our model confirms that subway usage is correlated with the usage of other forms of public transportation because using it as a proxy for the short-range mobility parameter allows us to predict the two peaks in the NYC infection rate in 2020. Using this model and the turnstile data from the NYC subway, we predict the trend of daily infections in NYC for a three-week period. Our model accounts for inter-region mixing of populations, and uses an infection rate that is dependent on the short-range mobility within a region.
While we have used NYC as a test case, it would be interesting to verify the model with data from other cities. We believe that by incorporating data from other public transportation services, such as taxis, ride-and bikesharing services etc., our model can offer more accurate predictions about the spread of an epidemic disease. Thus, it can be a useful tool in guiding public policies to tame the spread of pandemics.